DATA SET

This dataset comes from a consulting firm. Emails exchanged during the period 4 to 19 March 2019. 1174928 lines

Each line describes what a collaborator sent or received at MessDate. S/he sent (Id_Direction is equal to 1) or received an email from a contact (Id_Direction is equal to 2).

The interlocutor can be interne/externe/unidentified (PartnerTypeName).

The interaction involving the collaborator is defined by:

1.17 million de lines

509954 messages (unique pairs ["Subject",'MessDate']) in the dataset.

An email can be described by several lines.

Example1: if A and B are two internal colleagues, there will be 1 line dedicated to A and 1 line dedicated to B for the same message

Manager | 04/03/2019 00:03 | 1 | B | A | 4bb26edab1c7a7bd212a86e4308d128af11e117c
Senior  | 04/03/2019 00:03 | 2 | A | B | 4bb26edab1c7a7bd212a86e4308d128af11e117c

Example2: if A, B, C are 3 colleagues and they all receive the same message from an external partner, there will be 3 lines with same s(ubject,date) with id_Direction = 2.

DATA UNDERSTAND

This section presents important background information involved in this study of the data set.

Roles

Interactions

DATA PROCESSING

To be able to handle with the data set, we have done some data processing.

Creating the graph

EXPERIMENTS

Cluster

In this section, we have tried some experiments to identity the groups into the company based on email exchange between employees. Our objective is to see in details which works have more communication, which are the essential in operations, etc. The idea is to identity some patterns in the data that can be used for the Human Resources department to improve process and performance into the company. For that, we used K-means to predict the number of possible clusters, communities detection techniques to predict and visualize the network. Finally, we calculated the centrality measure into communities to identity the most ‘important’ employees.

Communities detection

Colours for each community created

K- Core

Modularity

size of each community

calculating centrality measure inside each communities

centrality measures into community class 0

Idenfity roles from 5 higest measures into communities